Poster Paper 



Proc. of Int. Conf. on Advances in Computer Science and Application 2013 



Performance Evaluation of Query Processing 
Techniques in Information Retrieval 

^rakasha S, 2 Shashidhar HR, 3 Dr. G T Raju, 4 Prajna Krishnan and 5 Shivaram K 
Research Scholar, 2 Asst. Professor, CSE dept., 3 Professor & Head, CSE dept., 4 5 PG student 
RNS Institute of Technology, Bengaluru 560098 
sprakashjpg@yahoo.co.in, shashi_dhara@yahoo.com, gtrajul990@yahoo.com, prajnakrishnan@yahoo.co.in, 

shivaram.bhat41 @ gmail.com 



Abstract - The first element of the search process is the query. 
The user query being on an average restricted to two or three 
keywords makes the query ambiguous to the search engine. 
Given the user query, the goal of an Information Retrieval 
[IR] system is to retrieve information which might be useful 
or relevant to the information need of the user. Hence, the 
query processing plays an important role in IR system. 

The query processing can be divided into four categories 
i.e. query expansion, query optimization, query classification and 
query parsing. In this paper an attempt is made to evaluate the 
performance of query processing algorithms in each of the 
category. The evaluation was based on dataset as specified by 
Forum for Information Retrieval [FIRE15]. The criteria used 
for evaluation are precision and relative recall. The analysis is 
based on the importance of each step in query processing. The 
experimental results show that the significance of each step 
in query processing and also the relevance of web semantics 
and spelling correction in the user query. 

Key words - query expansion, query optimization, query 
parsing, web semantics and query classification 

I. Introduction 

The main purpose of an IR system is to retrieve all the 
documents which are relevant to a user query while retriev- 
ing as few non-relevant documents as possible. The typical 
Information Retrieval (IR) model of the search process con- 
sists of three essentials: query, documents and search re- 
sults. A user looking to fulfil information need has to formu- 
late a query usually consisting of a small set of keywords 
summarizing the information need. 

Queries are posed by interactive users and can be 
ambiguous in nature. An interactive query goes through the 
entire path of query parser, expansion, optimization, and 
processing [1]. 

The query processor turns user queries and data modifi- 
cation commands into a query plan - a sequence of opera- 
tions (or algorithm) on the database from high level queries 
to low level commands. Complex queries are becoming com- 
monplace, with the growing use of decision support sys- 
tems. These complex queries often have a lot of common 
sub-expressions, either within a single query, or across mul- 
tiple such queries run as a batch. The query optimization 
aims at exploiting common sub-expressions to reduce evalu- 
ation cost.The user queries generally suffer from low preci- 
sion, or low quality document retrieval. To overcome this 
problem, scientists proposed methods to expand the 
©2013ACEEE 
DOL03.LSCS.2013.3.559 



original query with other topic-related terms extracted from 
exogenous (e.g. ontology, WordNet, data mining) or endog- 
enous knowledge (i.e. extracted only from the documents 
contained in the collection). Methods based on endogenous 
knowledge, also known as relevance feedback, make use of a 
number of labelled documents, provided by humans (explicit) 
or automatic/semi- automatic strategies, to extract topic-re- 
lated terms and such methods have demonstrated to obtain 
performance improvements [2]. 

A query parser, simply put, translates users search string 
into specific instructions for the search engine. It stands 
between user and the documents users are seeking, and so 
its role in text retrieval is vital. 

The Web is rich with various sources of information. It 
contains the contents of documents, web directories, 
multimedia data, and user profile and so on. The massive and 
heterogeneous web document collections as well as the 
unpredictable querying behaviours of typical web searchers 
exacerbate Information Retrieval (IR) problems [13]. Hence 
categorization of queries allows for increased effectiveness, 
efficiency, and revenue potential in general-purpose web 
search systems. Such categorization becomes critical if the 
system is to return results not just from a general web 
collection but from topic-specific databases as well. [16] 

One approach is to develop semantic Web services where 
by the Web services are annotated based on shared 
ontologies, and use these annotations for semantics-based 
discovery of relevant Web services. The Ontologies have 
been identified as the basis for semantic annotation that can 
be used for discovery. Ontologies are the basis for shared 
conceptualization of a domain, and comprise of concepts 
with their relationships and properties. Use of ontologies to 
provide underpinning for information sharing and semantic 
interoperability has been long realized. By mapping concepts 
in a Web resource (whether data or Web service) to ontological 
concepts, users can explicitly define the semantics of that 
resource in that domain. An approach for semantic Web 
service discovery is to have the ability to construct queries 
using ontological concepts in a domain. This in turn requires 
mapping concepts in Web service descriptions to ontological 
concepts. By having both the description and query explicitly 
declare their semantics, the results will be more relevant than 
keyword matching based information retrieval [10]. 

In this paper performance analysis is done based on the 
importance of each step in query processing i.e. The query 
optimization, query expansion, query parsing and query 
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classification by implementing some of the algorithm from 
each of the mentioned categories. Experimental results shows 
that the significance of each steps in query processing. To 
evaluate the quality and effectiveness of of query processing, 
the two basic measures for information retrieval are used. 
Precision: The fraction of retrieved documents that are 
relevant to the user query intent. 

Recall: The fractions of relevant documents that are retrieved 
are in context with user query. 

II. Related Work 

Amit Goyal et.al analyses exponential growth in number 
of possible strategies with the increase in number of relations 
in a query has been identified as a major problem in the field 
of query optimization of relational databases. But as the size 
of a query grows, exhaustive search method itself becomes 
quite expensive. By modifying the A* algorithm to produce a 
randomized form of the algorithm and compared it with the 
original A* algorithm and exhaustive search [4]. 

Yannis E. Ioannidis primarily discuss the core problems 
in query optimization and their solutions, and only touch 
upon the wealth of results that exist beyond that. More 
specially, author concentrates on optimizing a single at SQL 
query with 'and' as the only Boolean connective in its 
qualification (also known as conjunctive query, select-project- 
join query, or non-recursive Horn clause) in a centralized 
relational DBMS, assuming that full knowledge of the run- 
time environment exists at compile time [6]. 

Prasan Roy et.al demonstrates that multi-query 
optimization using heuristics is practical, and provides 
significant benefits. They propose three cost-based heuristic 
algorithms: Volcano-SH and Volcano-RU, which are based on 
simple modifications to the Volcano search strategy, and a 
greedy heuristic [8]. 

Joseph M. Hellerstein defines a query cost framework 
that incorporates both selectivity and cost estimates for se- 
lections. The algorithm is called Predicate Migration, and 
proves that it produces optimal plans for queries with expen- 
sive methods [12]. 

Francesco Colace et.al proposes a query expansion 
method to improve accuracy of a text retrieval system. The 
technique makes use of explicit relevance feedback to expand 
an initial query with a structured representation called 
Weighted Word Pairs. Such a structure can be automatically 
extracted from a set of documents and uses a method for term 
extraction based on the probabilistic Topic Model [2]. 

Samer Hassan et.al describes a new approach for 
estimating term weights in a document, and shows how the 
new weighting scheme can be used to improve the accuracy 
of a text classifier. The method uses term co-occurrence as a 
measure of dependency between word features. A random- 
walk model is applied on a graph encoding words and co- 
occurrence dependencies, resulting in scores that represent 
a quantification of how a particular word feature contributes 
to a given context [5]. 

Mikio Yamamoto et.al describes techniques for working 
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with much longer n-grams. Suffix arrays which were /first 
introduced to compute the frequency and location of a 
substring (n-gram) in a sequence (corpus) of length N. To 
compute frequencies over all N(N + l)/2 substrings in a corpus, 
the substrings are grouped into a manageable number of 
equivalence classes. The paper uses these frequencies to 
find "interesting" substrings. Lexicographers have been 
interested in n-grams with high mutual information (MI) where 
the joint term frequency is higher than what would be 
expected by chance, assuming that the parts of the n-gram 
combine independently. Residual inverse document 
frequency (RIDF) compares document frequency to another 
model of chance where terms with a particular term frequency 
are distributed randomly throughout the collection. MI tends 
to pick out phrases with non-compositional semantics (which 
often violate the independence assumption) whereas RIDF 
tends to highlight technical terminology, names, and good 
keywords for information retrieval [7]. 

Joshua Goodman proposes two new algorithms: the 
Labelled Recall Algorithm, which maximizes the expected 
Labelled Recall Rate, and the Bracketed Recall Algorithm, 
which maximizes the Bracketed Recall Rate [9]. 

Adrian D. Thurston et.al presents two enhancements to 
a basic backtracking LR approach which enable the parsing 
of computer languages that are both context-dependent and 
ambiguous [11]. 

Kaarthik Sivashanmugam et.al develops a semantic Web 
services where by the Web services are annotated based on 
shared ontologies, and use these annotations for semantics- 
based discovery of relevant Web services. They also discuss 
one such approach that involves adding semantics to WSDL 
using DAML+OIL ontologies. And also uses the approach 
for UDDI to store these semantic annotations and search for 
Web services based on them [10]. 

Graeme Hirst et.al proposes a method for detecting and 
correcting many such errors by identifying tokens that are 
semantically unrelated to their context and are spelling 
variations of words that would be related to the context. 
Relatedness to context is determined by a measure of 
semantic distance [3]. 

III. Query Optimization 

Query optimization is a function of many relational 
database management systems in which multiple query 
plans for satisfying a query are examined and a good query 
plan is identified. The improved A* algorithm, when used for 
query optimization, gives output comparable to exhaustive 
search in minimal amount of search space. Improved A* 
algorithm uses two linked lists instead of one used in original 
A* algorithm. It also considers global costs. Algorithm for 
optimizing n relations performs a large number of local 
optimizations. Each one starts at a random node and 
repeatedly accepts random downhill moves until it reaches a 
local minimum and it also returns the local minimum with the 
lowest cost found. 

The three cost-based heuristic algorithms: Volcano-SH and 
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Volcano-RU, which are based on simple modifications to the 
Volcano search strategy, and a greedy heuristic. The greedy 
heuristic incorporates novel optimizations that improve 
efficiency greatly 

To optimize a tree, all predicates are pushed down as far 
as possible, and then repeatedly apply the Series-Parallel 
Algorithm Using Parallel Chains to each stream in the tree, 
until no more progress can be made. 

Table I. Query Optimization 



Optimization 
Algorithms 


Input 


L/uipUl 


Precision 


rtei ah 


A* ;ilgoiithm[-l]. 


Output of 
original A* 
algorithm 
containing total 
cost of node to 
reach the parent 
node 


Total Cost in global 
costs 


17.29% 


28.30% 


Algorithm for 
optimizing n 
reIatiQns[8p]. 


a query of N 
relations 


• the unique set of N 
relations joined in the 
query are generated 
from the plans 

• cheapest plan/path is 
the final output 


21.3% 


41.73% 


• Volcano-SH 


directed acyclic 
graph of queries 


decides in a cost based 
manner which of the 
nodes to materialize and 
share 


11.85% 


19.45% 


• Volcano-RU 


batch of queries 


Optimized queries 


1536% 


20.39% 


• a greedy 
heuristic 


Expanded DAG 
for the 
consolidated 
input query 


Set of nodes to 
materialize and the 
corresponding, best plan 


19.98% 


34.38% 


Predicate 
Migration[12] 


Tree 


optimally place 
expensive predicates in 
a query plan 


27.37% 


29.76% 



From the query optimization table we infer Algorithm for 
optimization n relations performs in both the basic measures 
of Precision and the recall criteria of IR. The algorithm shows 
a precision of 21.3% and a recall of 41.73% which is 
comparatively higher when compared with other algorithms 
in the Table I. 

IV. Query Expansion 

In WWP feature selection the aim is to extract from a set 
of documents a compact representation, named Weighted 
Word Pairs (WWP), which contains the most discriminative 
word pairs to be used in the text retrieval task. In the Rela- 
tions Learning stage, where graph relation weights are learned 
by computing probabilities between word pairs and in the 
Structure Learning stage, where an initial WWP graph, which 
contains all possible relations between aggregate roots and 
aggregates, is optimized by performing an iterative proce- 
dure. 

In query expansion, when the algorithm WWP feature 
selection-relation stage is implemented with the structure 
learning stage algorithm there is considerable improvements 
in the both the precision and recall parameter of IR. It can be 
observed that in Table 2 that precision has almost increased 
by 22% and recall parameter doubled by 20%. 
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Table II: Query Expansion 



Query Expansion 


Input 


Output 


Precision 


Recall 


WWP feature 
selection- Relations 
Learning stage, 


set of 
documents 


vectorofweighted 

wordpairsg 

Discriiiiiiiativeword 

' L 1 1 ' J 

pairs to be used m me 
textretrievaltask. 


26.29% 


41.63% 


the Structure 
Learning stage 


a starting 
WWP L 
structure 


the set of parameters J 
which produces the 
best WWP graph 


4.49% 


18.90% 



V. Query parsing 

The Labelled recall algorithm maximises expected recall 
rate and Bracketed recall which maximises the bracketed recall 
rate. 

Table III: Query parsing 



Query Parsing 


Input 


Output 


Precision 


Recall 


• Labelled recall 


Tree 


MAXC(Ln) contains the 


10.94% 


1845% 


algorithm! 1 )] 




score of best parse, where 










nis length of the tree 






• Bracketed 
Recall[ll], 


Tree T g 


Max-g, bracketed recall 
rate 


9.80% 


2L48% 



In the query parsing when labelled recall algorithm is 
implemented with the Bracketed recall algorithm considerably 
improves in the recall parameter by approximately 3% even 
though the precision parameter remains almost the same and 
shown in Table III. 



VI. Query Classification 

The task of query classification is to assign a Web search 
query to one or more predefined categories, based on its 
topics. 

A random walk uses term co-occurrence as a measure of 
dependency between word features. The random- walk model 
is applied on a graph encoding words and co-occurrence 
dependencies, resulting in scores that represent a 
quantification of how a particular word feature contributes 
to a given context. 

Term frequency (tf) is the standard notion of frequency 
in corpus-based natural language processing (NLP); it counts 
the number of times that a type (term/word/n- gram) appears 
in a corpus. Document frequency (dr) counts the number of 
documents that contains a type at least once. Term frequency 
is an integer between and N; document frequency is an 
integer between and D, the number of documents in the 
corpus. 

An algorithm based on suffix arrays is used for comput- 
ing tf and dr and many functions of these quantities for all 
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substrings in a corpus in O(NlogN) time, even though there 
are N(N + l)/2 such substrings in a corpus of size N. The 
algorithm groups the N (N + l)/2 substrings into at most 2N 
- 1 equivalence classes. By grouping substrings in this way, 
many of the statistics of interest can be computed over the 
relatively small number of classes, which is manageable, rather 
than over the quadratic number of substrings, which would 
be prohibitive. 

Table IV: Query Classification 



i ill £lV\~ i l*li 1 £ 1 ltl J'"ltlAll 

ylltrl\ v. hlSMUUIUOn 


Input 


uuiput 


(-'FLU'*! t 1 l All 

n elision 


1? CP-ill 

iveuu 


Random-walk 


G=(V,E),thesetof 


The algorithm 


20.19% 


31.02% 


model [5]. 


vertices that point to 


terminates when the 








vertex Va 


convergence point is 








(predecessors), and 


reached for all the 








Out(Va)bethe set of 


vertices, meaning 








vertices that vertex 


that the error rate for 








Va points to 


each vertex falls 








(successors) 


below a pre-defined 
threshold. 






• To compute term 


Suffix auays 


tenn frequency (If) 


26.45% 


41.75% 


frequency^ [7]. 










» To compute term 


Suffix arrays 


document frequency 


13.45% 


19.34% 


document 




(dr) 






frequency (d r ) 











In the query classification comparisons of various algorithms 
in Table 4, the computed term frequency (t f ) algorithm when 
implemented along with compute term document frequency 
(d r ) there is remarkable increase in recall and precision 
parameter. It is clearly shown in table 4The precision parameter 
increases by almost 10% and recall parameter by 20%. 

VII. Spelling Errors 

Spelling errors that happen to result in a real word in the 
lexicon cannot be detected by a conventional spelling checker. 
A method is presented for detecting and correcting many 
such errors by identifying tokens that are semantically unre- 
lated to their context and are spelling variations of words 
that would be related to the context. Relatedness to context 
is determined by a measure of semantic distance. 

The algorithm for detecting and correcting malapropisms 
does show some improvement in recall and precision when 
implemented alone. But when its implemented with other 
algorithm it can make considerable difference in recall and 
precision parameters of query processing. 

Table V: Spelling errors 



Spelling errors 


Input 


Output 


Precision 


Recall 


Algorithm for 


consideration all 


Markawordas 




8.10% 


detecting and 


wordsin the 


coiifinnedor 






correcting 


text that 


unconfirmed 






ina!a|)ro|»[3] 




.raise an alarm 
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VIII. Web Semantics 

The Semantic Web is a collaborative movement led by 
the international standards body, the World Wide Web 
Consortium (W3C). The standard promotes common data 
formats on the World Wide Web [17]. By encouraging the 
inclusion of semantic content in web pages, the Semantic 
Web aims at converting the current web dominated by 
unstructured and semi- structured documents into a "web of 
data". The Semantic Web stack builds on the W3C's 
Resource Description Framework (RDF). 

Adding Semantics in WSDLIs to develop semantic Web 
services where by the Web services are annotated based on 
shared ontologies, and use these annotations for semantics- 
based discovery of relevant Web services. One such approach 
that involves adding semantics to WSDL is using DAML+OIL 
ontologies. Adding Semantics in UDDI is used to store these 
semantic annotations and search for Web services based on 
them. 

Semantic Web Service Discovery, Semantic annotations 
added in WSDL and in UDDI are aimed at improving discovery 
and composition of services involves ranking based on the 
semantic similarity between the precondition and effect 
concepts of the selected operations and preconditions and 
effect involves ranking based on the semantic similarity 
between the precondition and effect concepts of the selected 
operations and preconditions and effect concepts of the 
template [10]. 



Table VI: Semantic Web Table 



Web Semantics 


Input 


Output 


Precision 


Recall 


Adding 
Semantics in 
WSDL 


Message parts using 
XML schema 
constructs 


Mapping Message Parts to 
Ontological Concepts using 
XML schema constructs 


39.46% 


51.67% 


Adding 
Semantics in 
UDDI 


the second t Model 
represent the 
ontologies of input 


the third tModel represent 
the ontologies of output 


29.85% 


48.07% 


Semantic Web 
Service 
Discovery 


Phase 1: Web 
services (operations 
in different WSDL 
files) 


matches Web services 
functionality provided 


24.01% 


39.10% 


Phase 2: result set 
from the first phase 


Ranked first phase result set 
basis of semantic similarity 
between the input and 
output concepts 


Phase3:result set 
from the first phase 


Ranking based on the 
semantic similarity between 
the precondition and effect 
concepts of the selected 
operations and 
preconditions and effect 
concepts of the template. 



In the web semantics from Table VI, its observed that when 
semantics is added to the WSDL, UDDI and Web Service 
Discovery there is considerable increase in the precision and 
recall parameters. Hence adding web semantics to the query 
shows a considerable improvement in performance in IR. 



IX. Experimental Evaluation 

The experiments were conducted on the FIRE system; all 
the algorithms in each category of query processing, web 
semantics and spelling errors in query were implemented in 
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java. A Collection of the TREC WebTrack [13] and we 
measured precision and recall over a set Indian statistical 
institute, Kolkata. Precision was measured by precision at 10 
(P@ 10) and mean average precision [14]. 

Precision is the proportion of relevant documents among 
the retrieved documents. 

Precision is defined as follows 



Precision 



[ {relevant documents) H (retrieved documents 1 1 

{retrieved documents) 

The method for estimating the recall of the algorithms is 
counting the number of full evaluations required to return a 
certain number of top results. This measure has the advantage 
of being independent of the software and the hardware 
environment, as well as the quality of the implementation. 

Recall is the proportion of retrieved documents among 
the relevant documents. 

Recall is defined as follows 

Recall = l{ relevant documents (retrieved documents 1 1 
{retrieved documents) 




Fig 1: result of Query Optimization 

In query optimization, Algorithm for optimization n- 
relations from Fig 1, provides a better results in both IR 
criteria precision and recall when compared to other 
implemented algorithms. 

In query expansion from Fig 2, when WWP feature Selec- 
tion-Relations Learning Stage implemented along with the 
Structure Learning Stage its observed that there is increases 
in both precision and recall parameter of IR. 

In query parsing when Labelled Recall algorithm is 
implemented along with the Bracketed, from Fig 3, its observed 
that the recall parameter increases considerably. 

In query classification, the Compute Term Document Fre- 
quency algorithm is implemented along with the Compute 
Term Frequency. From Fig 4 its observed that it improves 
both precision and recall parameter metrics. 

In algorithm for detecting and correcting Malapropism 
there is seen a marginal increase in precision and recall form 
the Fig 5. But when implemented with the other query 
processing categories it can help increase in overall precision 
and recall parameters. 

In the web semantics, from Fig 6, it's seen that by 

©2013ACEEE 
DOL03.LSCS.2013.3.559 



40 



45 
40 
35 
30 
25 
20 
15 
10 
5 





iSeries 1 
ISeries 2 



Random-Walk Model ComputeTerm 
Frequency 



ComputeTerm 
Document 
Frequency 



Fig 2: Results of Query Expansion 
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Fig 3: Results of Query Parsing 

adding semantics to WSD, UDDI and Web Service Discov- 
ery shown a visible increase in precision and recall param- 
eters. 
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WWP Feature Selection- The Structure Learning Stage 
Relations Lea rningStage 



Fig 4: Results of Query Classification 

For approximate query processing, web semantics and 
spelling error strategies (when we can have false negatives) 
we measure two parameters: First we measure the performance 
gain, as before. Second, we measure the change in recall and 
precision by looking at the distance between the set of 
original results and the set of results produced by the 
implemented algorithms. 
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Fig 5: Results of Spelling Errors 




Fig 6: Results of Web Semantics 

Conclusion and Future Work 

The implementation of various categories of query 
processing demonstrates that using the query processing at 
various stages of IR can yield substantial gains in efficiency 
at no loss in precision and recall. The motivation for handling 
query processing is to handle the incompleteness in user 
query due to increasing ambiguous input from the user. In 
this context, a related issue is handling query imprecision- 
most users of online databases tend to pose imprecise queries 
which admit answers with varying degrees of relevance. 

We have described the experiments that empirically 
validate our approaches in various stages of query 
processing. Specifically the experiments substantiate two 
premises. 

- In order to estimate recall and precision we need to 
identify the set of values in a sample that represent a single 
real world object. In our approach these sets correspond to 
the result of the query processing. The experiments show 
that the result of the query processing may be used to improve 
efficiency of the search process. 

- When the sample is big enough and similarity metrics 
are adequate for the column domain the recall/precision 
results are very similar to the actual recall/precision values 
obtained when querying the database. 

However, several problems are still open. Further, the size 
of the samples obviously affects the results of the estimation 
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process. In future we can work on larger dataset and also 
find the minimum dataset size required to estimate the 
precision and recall problem. More variety algorithms can be 
taken under the mentioned categories to get a complete 
picture on query processing. 
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